Skip to content

Conversation

@schmikei
Copy link
Contributor

@schmikei schmikei commented Nov 17, 2025

Some of the screenshots are missing data mostly due to me not setting up sharding, but queries/functionality should be pretty similar to the original.

MongoDB Atlas cluster overview
image

Paginated the tables 🚀
image
image
image
image

MongoDB Atlas elections overview
image
image

MongoDB Atlas operations overview
image
image
image

MongoDB Atlas performance overview
image
image
image

MongoDB Atlas sharding overview
image
image
image

- alert: MongoDBAtlasElectionTimeouts
annotations:
description: The number of elections being called due to the primary node timing out in replica set {{$labels.rs_m}} in cluster {{$labels.cl_name}} is {{printf "%.0f" $value}} which is above the threshold of 10.
description: The number of elections being called due to the primary node timing out in replica set {{$labels.rs_nm}} in cluster {{$labels.cl_name}} is {{printf "%.0f" $value}} which is above the threshold of 10.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identified a typo here

@schmikei schmikei marked this pull request as ready for review November 18, 2025 19:08
@schmikei schmikei requested a review from a team as a code owner November 18, 2025 19:08
Copy link
Member

@Dasomeone Dasomeone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have to leave it here as it's end of day, but first pass review sort of done.

Generally, layout is A+ and I'm perfectly happy with it.
Couple suggestions for improvements in terms of legend tabels and filtering, and I have yet to do a pass on the usage of common-lib so no comments there yet

dashboardTimezone: 'default',
dashboardRefresh: '1m',
// Basic filtering - MongoDB Atlas uses job and cl_name (cluster name) as primary filters
filteringSelector: 'job="integrations/mongodb-atlas"',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we've recently talked about, we can vendor latest logs-lib and unset this for the public mixin

alertsDeadlocks: 10, // count
alertsSlowNetworkRequests: 10, // count
alertsHighDiskUsage: 90, // %
alertsSlowHardwareIO: 3, // seconds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like I commented on a previous PR, we could consider having the units be more tightly coupled with the native metric unit, e.g. milliseconds in order to simplify the query

{
alert: 'MongoDBAtlasSlowHardwareIO',
expr: |||
(sum without (disk_name) (increase(hardware_disk_metrics_read_time_milliseconds[5m])) + sum without (disk_name) (increase(hardware_disk_metrics_write_time_milliseconds[5m]))) / 1000 > %(alertsSlowHardwareIO)s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like I commented on a previous PR, we could consider having the units be more tightly coupled with the native metric unit, e.g. milliseconds in order to simplify the query

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the amount of panels and dashboards, could you please split this panel file into logical groups similar to the structure used by the Kafka and SNMP observability libraries

Comment on lines +167 to +188
hardwareIO:
commonlib.panels.generic.timeSeries.base.new('Hardware I/O', targets=[
signals.cluster.diskReadCount.asTarget(),
signals.cluster.diskWriteCount.asTarget(),
])
+ g.panel.timeSeries.panelOptions.withDescription("The number of read and write I/O's processed.")
+ g.panel.timeSeries.standardOptions.withUnit('iops')
+ g.panel.timeSeries.options.legend.withPlacement('right')
+ g.panel.timeSeries.options.legend.withAsTable(true),

hardwareIOWaitTime:
commonlib.panels.generic.timeSeries.base.new('Hardware I/O wait time / $__interval', targets=[
signals.cluster.diskReadTime.asTarget()
+ g.query.prometheus.withInterval('2m'),
signals.cluster.diskWriteTime.asTarget()
+ g.query.prometheus.withInterval('2m'),
])
+ g.panel.timeSeries.panelOptions.withDescription('The amount of time spent waiting for I/O requests.')
+ g.panel.timeSeries.standardOptions.withUnit('ms')
+ g.panel.timeSeries.options.tooltip.withSort('desc')
+ g.panel.timeSeries.options.legend.withPlacement('right')
+ g.panel.timeSeries.options.legend.withAsTable(true),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these two panels, I think it'd be beneficial if we make use of the table legend options to add last*, min, mean, and max columns here. Should be available via standard options, can't remember off the top of my head, but Gabriel just used it in the postgres mixin last week

+ g.panel.timeSeries.standardOptions.withUnit('reqps')
+ g.panel.timeSeries.options.tooltip.withSort('desc'),

networkThroughput:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for last*, and at least mean as additional data columns for a quick overview on the side

Comment on lines +355 to +357
//
// Elections panels
//
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these election panels, with multiple series per instance monitoring we may need to do some filtering on the 0 values, though I'm also worried about discarding good data. @schmikei @aalhour any ideas here?

+ g.panel.timeSeries.standardOptions.withUnit('reqps')
+ g.panel.timeSeries.options.tooltip.withSort('desc'),

slowNetworkRequestsPerformance:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 panel here and general networkThroughputPerformance as well that may need some filtering as it is and will keep affecting the y axis scaling

+ g.panel.timeSeries.options.legend.withPlacement('right')
+ g.panel.timeSeries.options.legend.withAsTable(true),

hardwareIOWaitTime:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 filtering, I'll stop commenting on them now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants